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This  report  develops  two  generalizations  of  the  standard  Linear 
Predictive  Coding  (LPC)  implementation  of  a narrow  band  speech  com- 
pression system.  The  purpose  of  each  method  is  to  improve  the  speech 
quality  that  is  available  from  a standard  LPC  system.  Attentim  is 
focused  primarily  upon  the  pitch  excited  system  and  therefore,  the 
improvements  considered  focus  upon  the  improved  estimation  of  the 
reflection  coefficients  and  the  pitch  period.  Specifically,  a para- 
meter filtering  algorithm  is  developed  for  dynamically  smoothing  the 
reflection  coefficients  to  both  increase  naturalness  in  synthetic 
speech  as  well  as  eliminate  the  possibility  of  synthesis  filter  insta- 
bilities. Secondly,  a new  method  for  calculating  the  k-parameters  of 
an  LPC  inverse  filtering  algorithm  is  developed,  STREAK.  New  values 
for  each  k parameter  are  calculated  at  each  sample  point  directly 
using  the  lattice  formulation  of  the  inverse  filter  model.  It  is 
shown  this  technique  can  be  used  to  improve  a pitch  detection  scheme 
based  upon  the  autocorrelation  of  inverse  filter  output  sequence. 
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INTRODUCTION 


This  report  is  concerned  with  methods  for  improving  narrow  band 


synthesis  speech  quality  generated  by  an  LPC  analysis  synthesis  system. 


As  such  it  is  assumed  that  the  reader  is  familiar  with  the  basic  theory 


behind  Linear  Predictive  Coding  applied  to  speech.  There  are  numerous 


references  available  which  describe  the  techniques,  advantages,  and 


disadvantages  of  LPC,  [l]  [2]  [3].  The  intent  of  this  report  is  to 


introduce  and  develop  two  new  methods  for  inproving  speech  quality  by 


enlarging  or  replacing  parts  of  so-called  standard  approaches.  There- 


fore, only  a brief  description  of  LPC  will  be  presented  so  as  to  provide 


a foundation  to  which  these  new  techniques  can  be  referenced  . 


Needless  to  say,  there  are  many  ways  for  improving  speech  quality.  This 


report  will  primarily  focus  its  attention  on  two  major  areas:  one, 
improved  estimation  of  reflection  coefficients  and  two,  improved  estima- 


tion of  pitch.  Other  methods  for  improvements  such  as  parameter  quanti- 
zations, and  coding  [4],  fixed  point  implementation  [5],  and  mode  of 


transmission  [6]  .although  extremely  important, will  not  be  addressed. 

Following  a brief  description  of  LPC  are  the  reports  in  three 
major  parts.  Part  one  describes  the  various  procedures  which  comprise 
the  complete  analysis-synthesis  system.  Part  two  describes  a technique 
for  the  improved  estimation  of  reflection  coefficients  using  a minimum 
variance  a priori  least  squares  estimator.  Part  three  describes  a new 


method  for  calculating  the  reflection  coefficients  or  k-parameters 


associated  with  tne  lattice  form  of  the  inverse  filter  and  shows  how 


this  procedure  for  inverse  filtering  can  be  used  for  improved  pitch 


tracking  estimates. 

Narrow  Band  Speech  Compression 

Digital  speech  transmission  using  conventional  pulse  code  modula- 
tion requires  channel  bandwidths  on  the  order  of  60,000  bits  per  second. 
In  order  to  reduce  this  rate  to  what  might  be  called  a narrow  band 
speech  compression  system  (typically  4000  bps  or  less)  it  is  necessary 
to  parameterize  the  speech  waveform  into  a smaller  (typically  13  to  20) 
set  of  slowly  varying  parameters.  Estimates  of  these  parameters  are 
computed  at  some  prescribed  analysis  rate,  typically  from  40  to  200  times 
per  second.  The  parameters  are  then  quantized,  encoded  and  sent  to  the 
synthesizer  across  a transmission  channel  at  a prescribed  rate.  Here 
they  are  decoded  and  supplied  to  a synthesizer  algorithm  which  generates 
a synthesis  speech  waveform  which  hopefully  sounds  like  the  original  in 
some  acceptable  manner.  Thus,  if  the  speech  waveform  were  characterized 
by  say  13  parameters,  which  could  be  coded  to  an  average  of  5 bits  each 
and  updated  and  sent  every  200  ms,  one  would  have  a speech  compression 
system  requiring  3250  bps.  The  major  components  of  such  a system  con- 
sist of  an  analyzer,  a coder,  a decoder,  and  a synthesizer.  The  stan- 
dard parameters  estimated  in  the  analyzer  consist  of  the  signal  energy, 
the  voiced- unvoiced  decision,  the  pitch  period,  and  the  set  of  vocal 
track  descriptors.  The  vocal  track  is  assumed  to  be  accurately  modeled 
by  a digital  filter  defined  as  a ratio  of  polynomials.  If  the  filter  is 
assumed  to  have  only  poles,  then  linear  predictive  coding  can  be  used  to 
estimate  the  filter  parameters,  as  well  as  determine  energy,  pitch  and 
voicing. 
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Speech  Analysis  Using  Linear  Prediction 


A linear  prediction  analysis  of  speech  assumes  that  the  n111  speech 


sample,  sn  can  be  predicted  approximately  by  a linear  combination  of 


the  preceding  p samples.  Thus  in  approximation  is  given  by 

P 


s = y a.  s 
n ^ i n-i 


where  (a^,i=l ,2,. .p}  is  the  set  of  real  constants  called  predictor 


coefficients  which  are  to  be  estimated.  Values  for  these  coefficients 


are  found  by  minimizing  the  sum  of  the  squares  of  the  prediction  error 


sequence , en  where 


2 = s - s'  - s - [ a.  s . 
n n n n l n-i 


Thus  values  for  the  predictor  coefficients  are  determined  using  a 


least  squares  estimator  having  as  a loss  function  to  be  minimized 


E = £ e''  = £ (s  - y a.s  .)‘ 
“ n L n i=i  1 n_1 


Note,  as  shown  in  Chapter  II,  this  loss  function  E can  be  expanded  to 


include  a priori  '.formation  leading  to  a smoother  minimum  variance 


estimate. 


There  are  two  basic  approaches  to  linear  prediction  analysis. 


They  vary  according  to  the  range  of  n used  in  defining  the  loss 
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WPP 


function  E,  and  the  definition  of  the  signal  sn  in  that  range.  When 
using  the  covariance  method  the  signal  is  defined  over  a finite  range 
-p<n<N-l  and  minimizing  E leads  to  the  set  of  normal  equations  [7] 


J^ai  ^ij  = ^j,o  j 1,2,. ..,p 


where 


<t>  • • _ v s . s . 
ij  - Z n-i  n-j 

n=l 

The  coefficient  matrix  ] is  positive  definite  covariance 

matrix  and  the  system  of  equations  can  be  efficiently  solved  using  a 

triangularization  method  sometimes  called  the  Cholesky  decomposition 

[8]  (See  Chapter  II).  When  using  the  autocorrelation  method, the  signal 

s is  multiplied  by  a window  of  length  N such  that  s =0  for  n<0  and 
n n 

n>N-l.  The  range  of  n i.s  assumed  infinite  and  minimization  of  E leads 
to  the  normal  equations  [7] 


? a.  r, . „ . 


i=l 


|i -y 


. . ,p 


where 


N-l 

r.  = T s s . 
1 n=0  n n+1 


The  coefficient  matrix  [R| i_ j | ] is  a positive  definite  Toeplitz 
matrix  and  the  system  of  equations  can  be  efficiently  solved  using 
Levinson's  recursion  [9]. 
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Itakura  [10]  initially  showed  that  a linear  prediction  analysis 
can  be  formulated  in  terms  of  another  equivalent  set  of  parameters 
(k^i=l,2,. .p)  called  PARCOR,  or  reflection  coefficients.  It  had  been 
shown  that  these  k-parameters  are  well  suited  as  transmission  para- 
meters for  a narrow  band  speech  compression  systems  since  they  exhibit 
superior  quantization  properties  [4],  [10]  and  stability  of  the  syn- 
thesis filter  is  guaranteed  if  |kj<l  [Ill- 

Using  the  linear  prediction  model  provides  an  effective  method 
for  detecting  the  pitch  period.  If  the  all-pole  model  defined  by  lin- 
ear prediction  accurately  represents  the  vocal  tract  transfer  function, 
and  the  radiation  and  glottal  volume  flow  effects , then  the  output  of 
the  inverse  filter,  that  is,  the  error  sign  •/%  , should  resemble  an 
impulse  like  driving  function  having  a period  equal  to  the  pitch  for 
voiced  speech.  Absence  of  periodicity  would  imply  ’invoiced  speech. 

Two  approaches  to  pitch  detection  using  inverse  filtering  are  addressed 
in  this  report.  The  first  concerns  the  standard  block  analysis 
approach,  SIFT  [12]  and  is  described  in  Chapter  I.  The  second  uses  a 
point  by  point  analysis,  STREAK  and  is  describe  1 in  Chapter  III. 

The  energy  needed  to  appropriately  scale  the  synthesized  output 


waveform  can  be  obtained  from  the  coefficient  matrix  (either  <j>n  n or 

N-l  „ u’u 


r~) , or  by  using  the  energy  of  the  error  signal  itself,  £ en> 


n=0 


Methods  for  Improving  Speech  Quality 
Considerable  effort  is  currently  be’ng  devoted  to  methods  for 
improving  the  quality  of  synthesized  speech  generated  from  a narrow 
band  compression  system.  Already  noted  are  the  studies  in  parameter 
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quantization,  fixed  point  implementation  and  improved  modes  of  trans- 
mission. Other  techniques  exist  such  as  more  elaborate  vocal  tract 
models,  ard  analysis  [13],  [14]  as  well  as  voice  excited  and  error 
excited  synthesizers  [15].  However,  implementing  such  techniques  as 
these  introduces  the  added  drawbacks  of  increase  complexity  and  computa- 
tion and  increased  channel  bandwidth.  If  the  compression  system  is 
expected  to  operate  in  real  time  then  complexity  and  computation  must  be 
minimized  and  if  the  bandwidth  is  to  be  constrained  at  a rate  less  than 
4000  bps  then  the  more  elaborate  excitation  sequences  must  be  simplified 
to  a unit  impulse  driven  sequence.  In  order  to  conform  to  these  con- 
straints of  minimizing  L computation  rate  and  channel  bandwidth,  the 
techniques  discussed  in  this  report  focus  primarily  upon  the  improved 
estimation  of  reflection  coefficients  and  the  pitch  period  using 
methods  which  are  uncomplicated  enough  not  to  prohibit  real  time 
implementation  or  narrow  band  transmission. 
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I.  DESCRIPTION  OF  THE  ANALYSIS  - SYNTHESIS  SYSTEM 


General  Data  Flow 

A diagram  showing  the  various  stages  in  the  entire  system  is 
shown  in  Figure  1.1.  The  original  analog  speech  waveform  is  first  low 
pass  filtered  by  an  anti-aliasing  filter  having  cutoff  frequency  f., 
sampled  at  a rate  fg,  and  quantized  to  q bits  per  sample.  The  digitized 
waveform  is  then  stored  on  magnetic  tape  or  disk. 

The  analysis  portion  of  the  system  consists  of  three  parts:  (1) 
the  data  control  routine  for  determining  how  the  data  is  analvzed  and 
transmitted  to  the  synthesizer;  (2)  the  actual  analysis  routines  for 
estimating  the  vocal  tract  parameters:  reflection  coefficients,  pitch, 
voicing  and  energy;  and  (3)  the  coding  routines  for  optimally  quantizing 
th;  analysis  parameters  for  channel  transmission. 

The  coded  parameters  are  transmitted  through  the  channel  at  a con- 
stant rate  called  the  channel  frame  rate. 

The  synthesis  portion  of  the  system  consists  of  two  parts:  (1) 

a routine  for  decoding  the  transmitted  parameters;  and  (2)  the  synthesis 
routine  for  recursively  generating  synthetic  speech.  These  samples  are 

also  stored  on  magnetic  tape  or  disk. 

/ synthesized  analog  waveform  is  obtained  from  the  D to  A conversion 
of  the  processed  samples  which  has  been  low  pass  filtered  by  an  anti- 
imaging filter  having  cutoff  frequency  fc- 
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Data  Management  [The  Data  Control  Routine-) 

The  approach  taken  for  analyzing  the  incoming  speech  waveform  is 
to  separate  it  into  possibly  overlapping  data  sections  called  analysis 
frames,  and  extract  a set  of  analysis  parameters  from  each  frame.  At 
the  completion  of  each  analysis,  the  frame  is  shifted  down  the  time 
line  by  loading  new  samples  into  the  front  end  and  dropping  old  samples 
off  the  back  end.  Thus  one  can  view  this  approach  as  extracting  the 
analysis  parameters  from  that  ,jortion  of  the  waveform  which  lies  under 
a sliding  analysis  window. 

Advancing  the  Analysis  Frame 

The  approach  used  is  to  advance  the  analysis  frame  "pitch  syn- 
chrounously".  Specifically  the  analysis  frame  is  shifted  by  an  amount 
equal  to  the  last  estimated  pitch  period.  This  policy  is  followed 
except  when  the  pitch  period  becomes  less  than  a preset  minimum  jump 
distance  for  which  the  frame  is  then  shifted  a multiple  of  that  pitch 
period.  For  example,  if  the  minimum  junp  distance  is  5 ms.  and  the 
pitch  period  is  4 ms. , the  frame  is  then  advanced  by  8 ms. 

Size  of  the  Analysis  Frame 

The  size  of  the  analysis  frame  is  dictated  by  the  amount  of  data 
needed  to  extract  estimates  of  the  pitch  and  reflection  coefficients 
accurately.  The  analysis  frame  size  for  estimating  pitch  is  set  at 
40  ms.  whereas  the  frame  size  for  estimating  the  reflection  coeffic- 
ients is  set  at  16  ms  when  using  the  covariance  method  and  it  32  ms 
when  using  the  autocorrelation  method.  Thus  the  overall  analysis 
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There  are  two  different  analysis  frame  rates  used  in  the  analysis. 
For  coefficient  estimation  the  frame  rate  is  pitch  synchronous  as 
described  above.  As  is  described  in  Chapter  II,  this  higher  rate  is 
necessary  for  smoothing  the  reflection  coefficients. 

However,  the  analysis  portion  of  the  system  does  no  smoothing  on 
the  pitch  estimates  and  therefore  they  need  only  be  calculated  as  often 
as  is  dictated  by  the  channel  frame  rate.  That  is,  new  estimates  of 
pitch  and  voicing  are  needed  only  as  often  as  they  must  be  transmitted 
to  the  synthesizer.  (Typically  the  channel  frame  rate  is  set  at  50 
frames/second  or  less.)  Thus  the  pitch  extraction  analysis  rate  is 
set  up  to  be  multiple- pitch  synchronous.  A new  pitch  estimate  is  com- 
puted when  the  analysis  frame  has  been  shifted  in  time  to  a point 
required  for  a new  pitch  estimate  to  be  transmitted  to  the  synthesizer. 
For  example,  if  the  channel  frame  rate  is  set  at  50  frames  per  second, 
then  new  pitch  estimates  are  required  every  20  ms.  If  the  previous 
pitch  period  was  found  to  be  5 ms.  then  the  next  pitch  estimate  will  be 
computed  after  four  shifts  of  the  analysis  frame. 

In  summary,  a data  control  routine  specifies  the  length  of  the 
analysis  frame  and  determines  how  it  is  shifted  down  the  time  line  and 
which  analysis  parameters  are  to  be  computed  at  each  shift.  The  analy- 
sis is  pitch  synchronous.  New  reflection  coefficients  are  computed 
and  smoothed  at  each  shift.  The  reflection  coefficients  which  are 


transmitted  to  the  synthesizer  consist  of  these  values  present  at  the 
points  required  for  channel  transmission.  Thus  if  the  channel  frame 
rate  is  less  than  the  pitch  rate,  the  transmitted  reflection  coeffici- 
ents represent  a down-sampled  version  of  the  coefficients  being  esti- 
mated. Pitch  and  voicing  are  computed  and  transmitted  at  multiple 
shifts  as  dictated  by  the  channel  frame  rate  frequency.  The  length 
of  each  shift  is  set  equal  to  the  last  pitch  period  estimated. 

Reflection  Coefficient  Estimation  and  Smoothing 
A diagram  showing  the  various  parts  of  the  coefficient  estimation 
routine  is  given  in  Figure  1.2.  Except  for  the  smoothing  algorithm 
which  is  appended  at  the  end,  the  operations  are  similar  if  not  identi- 
cal to  standard  methods  for  estimating  reflection  coefficients.  The 
routine  can  be  used  to  estimate  reflection  coefficients  using  either 
the  covariance  [l],  (Atal)  method  or  the  autocorrelation  [2]  (Markel, 
Itakura)  method,  which  method  is  used  depends,  of  course,  upon  the  form 
of  the  linear  system  of  equation  which  is  to  be  solved. 

Coefficient  Analysis  Frame  Length 
The  overall  analysis  frame  size  specified  by  the  data  control 
routine  is  considerably  larger  (typically  40  ms.)  than  the  coefficient 
analysis  frame  size  to  be  used  for  estimating  the  reflection  coeffici- 
ents, (either  16  or  32  ms).  Thus  the  first  step  in  this  routine  is  to 
extract  a subset  of  length  N from  the  center  of  the  analysis  frame. 
This  subset,  the  coefficient  analysis  frame,  is  used  to  compute  an 
energy  term  and  reflection  coefficients. 
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REFLECTION  COEFFICIENT 
ESTIMATION 
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Energy  Calculation  and  Silence  Detection 

The  zeroeth  correlation  term, 

N , N-l  2 

*0,0  * J sn  or  r0  ■ sn 
* n=l  n=u 


is  initially  computed  and  from  it  is  computed  the  energy  estimate, 


R1  is  then  compared  against  a threshold  value,  HIRES,  to  determine  if 
the  waveform  to  be  analyzed  represents  a silent  region.  For  12-bit 
samples,  values  of  R1  less  than  a THRES  equal  to  12  imply  the  analysis 
frame  represents  silence  and  the  routine  is  exited. 


w 


Matrix  Loading  (Covariance  Method) 

The  linear  system  of  equations  to  be  used  for  estimating  the 
reflection  coefficients  was  developed  in  Reference  [7]. 

The  linear  system  Is  given  by: 


*11  *12  *i,p 

al 

1,0 

*2,1  *22  *2 ,p 

*2 

= 

*2,0 

< 

*p,l  *p,2  * ' **p,p 

a 

4* 

P>0 

— 

P 

-J 

N 

*ij  * I sn-i  5 n-j 
J n=l 


and  a - , i = 1 , 2 , . . . , p = Maximum  Likelihood  or  the 

1 Classical  estimate  of  the  pre- 

dictor coefficients 

Initially  d>  . i = o,  1,  . . . , p,  are  computed  and  from  these  values 
0,1, 

the  remaining  elements  of  the  matrix  are  found  using  the  standard 
method: 

2 _ 2 

*i,i  *i-l,  i-1  s-i  + 1 SN+i-i  i = 2,...,  p 

♦i+l,i  = ^i, i-1  + S-1  S-i+l  ' SN-i  SN-i+l  i = 1,2, ...p-1 

*i,i+l  = *i+l,i 

Matrix  Loading  (Autocorrelation  Method) 

The  linear  system  of  equations  used  to  estimate  the  reflection 
coefficients  for  the  autocorrelation  method  is  given  by  [7] 


r0  rl  •*'  rp-l 


ri  ro  • 


Vi  ...  r0 


WHERE 


N-l-i 


r = y 
n i=0 


s.s.^ 

1 l+n 


a^,  i*l,2,...,p=  Maximum  likelihood  or  the  classical  estimate  of  the 


predictor  coefficients. 


s^,  n=0,l , . . ,N-1=  windowed  speech  sampled  (usually  using  a Hamming 


type  window).  Makhoul  [3]  and  Markel  [2]  have  shown  that  preemphasizing 


the  input  speech  improves  the  synthesis  speech  quality.  With  that  imple- 


mentation, the  samples  used  to  form  r^  are  given  by 


s = (s  -c*s  ,)»W 
n v n n-17  n 


WHERE  sn  = speech  samples 


W^  = window  samples 


c = preemphasis  coefficient,  which  is  sampling  frequency 
dependent.  See  Markel  [2]. 


An  efficient  method  for  calculating  the  short  term  reflection  coeffici- 


ents r^  has  been  developed  by  both  Blankenship  [16]  and  Pfiefer  [17]. 


P 


Coefficient  Solution  from  Linear  Equations 
The  maximum  likelihood,  unweighted  least  squares  estimate  of  the 
reflection  coefficients  is  obtained  from  the  linear  system  of  equa- 
tions using  the  Cholesky  decomposition  method  [3]  (Mitsui).  In  ma- 
trix notation  let  equation  1.3  or  1.4  be  represented  as: 

1.5 


T T 
hHoi  = Hy 


as  can  be  shown 


H‘H  = LDU 


L = U , D = diagonal  matrix 


1.6 


where  L is  a nonsingular  triangular  matrix  obtained  from  the  Cholesky 
decomposition.  Substituting  gives: 


or 


LDUa  = H4  y 

™ML  - HTy 
U“  "<ML 


1.7 

1.8 
1.9 


WHERE  k^  = p x 1 vector  of  maximum  likelihood  reflection  coefficients. 
The  k^  parameters  represent  those  reflection  coefficients  obtained 
using  the  classical  unweighted  least  squares  solution.  They  can  now  be 
smoothed  using  some  method  such  as  the  a priori  least  smoothing  tech- 
nique. If  no  smoothing  is  to  be  used,  the  analysis  except  for  stabil- 
ity checks  and  the  error  energy  calculation,  has  been  completed. 

Reflection  Coefficient  Smoothing 

This  nortion  of  the  algorithm  represents  a primary  contribution 
to  this  report  and  is  discussed  in  considerable  detail  in  the  next 
chapter. 
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A term  which  is  used  as  part  of  a secondary  criterion  for  the 
voiced- unvoiced  decision  is  the  energy  of  the  prediction  error  sequence 
Define  the  error  sequence  as 


5_  = s„  - > a.  s . 
n n i=l  1 n’^ 


Then  it  can  be  shown  that 


N 2 P 

EV  = I e = o " Z 4>-:  n (Covariance) 
n=l  * i=i  1 A»u 

N 2 P 

EV  * I en  “ rn  ' Z ai  Ti  (Autocorrelation) 

n-1  n u i=l  1 1 


From  this  energy  tenn  is  computed  a ratio  term  which  is  used  as  a sec- 
ondary voiced- unvoiced  decision  criterion  (Atal  [1]).  Define 


RATIO 


Then  assuming  14-bit  speech  samples,  for  RATIO  less  than  0.7  x 108  the 
analysis  frame  is  defined  to  be  unvoiced. 

Stability  Check 

The  reflection  coefficients  are  checked  for  stability  before 
exiting  the  routine.  Any  reflection  coefficient  having  a magnitude 
greater  than  or  equal  to  1 is  redefined  to  have  a magnitudt  of  0.97. 
Using  the  autocorrelation  method  guarantees  stability  assuming  floating 
point  implementation  [11], 
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Pitch  and  Voicing  Detection  (Block  Analysis  Approach) 

A diagram  showing  the  various  parts  of  the  pitch  and  voicing 
detection  routine  is  shown  in  Figure  1.3.  This  routine  is  called  at 
a rate  defined  by  the  channel  frame  rate  frequency.  Thus,  if  the 
channel  frame  rate  is  set  at  50  frames  per  second,  pitch  and  voicing 
values  will  be  estimated  50  times  per  second.  The  time  interval 
between  each  estimate,  being  defined  as  that  multiple  of  the  previous 
pitch  period,  which  shifts  the  analysis  frame  into  the  next  20  ms. 
interval. 

The  estimate  of  the  pitch  period  is  found  by  autocorrelating  on 
the  prediction  error  signal.  This  method  is  similar  to  Markel's  SIFT 
algorithm  [12]  and  Itakura's  modified  autocorrelation  method  [18]. 


Digital  Low-Pass  Filter  ing  and  Down-Sampling 
From  the  data  control  routine  a 40  ms.  analysis  frame  is  extracted 
from  the  speech  waveform.  This  analysis  frame  is  then  digitally  low- 
pass  filtered  with  a three  pole  Chebyshev  2-dB  ripple  filter  having  a 
3-dB  cutoff  at  750  Hz  for  an  8 kHz  sampling  rate  and  at  600  Hz  cutoff 
for  a 6.4  kHz  sample  rate.  This  signal  is  then  down-sampled  four  to 
one  and  passed  on  to  the  pitch  detection  routine. 


Correlation  on  Error  Signal  and  Pitch  Estimate 
The  down- sampled  signal  is  windowed  with  a Hamming  window  and  the 
correlation  W(n) , of  the  error  signal  e(n),  is  formed  using  a third 
order  predictor.  Define 


c = 
n 


s 

n 


n-i 


n=l, . . . , N 


1.14 


; 


•j 

* 

■s 


v 

i 


I 

j 
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as  the  error  signal,  and 


N 

W(n)  = l e^+n  n = 0,1,...,  maximum  pitch  period  1.15 
i=l 

as  its  short  term  autocorrelation.  Where  sn  equals  the  down-sampled 
windowed  speech  samples.  Chapter  III  discusses  a method  for  generating 
the  error  sequence,  en  using  the  STREAK  algorithm. 

The  W(n)  sequence  is  searched  between  the  limits  of  four  and  forty 
for  Its  maximum  value.  The  index  M having  the  largest  value  after 

interpolation  defines  the  pitch  period  estimate.  The  magnitude  of 
W(M)/W(01  is  used  for  voicing  detection. 

Voicing  Decision 

Hie  correlation  detector  will  have  a magnitude  between  zero  and 
one.  If  the  correlation  detector  has  a value  greater  than  or  equal  to 
0.3,  the  analysis  frame  is  initially  defined  to  be  voiced  speech.  If 
the  value  of  RATIO  is  defined  in  the  previous  section  is  also  greater 

O 

than  the  threshold  value  0.7  x 10  , then  the  analysis  frame  remains 

o 

defined  as  voiced.  If  RATIO  is  less  than  0.7  x 10  the  voicing  deci- 
sion is  switched  to  unvoiced. 

Speech  Synthesis 

A diagram  showing  the  various  parts  of  the  synthesizer  is  shown  in 
Figure  1.4.  The  decoded  channel  parameters  consisting  of  p reflection 
coefficients,  pitch,  voicing  and  energy  are  received  and  used  for  up- 
dating the  synthesizer  at  the  channel  frame  rate. 


Parameter  Updating  and  Interpolation 
There  are  two  sets  of  channel  parameters  available  to  the  syn- 
thesizer at  any  one  time,  a left  set  and  a right  set.  Each  set  cor- 
responds to  successive  left  and  right  analyses  separated  by  the  multiple 
of  the  pitch  period  estimated  in  the  left  analysis.  The  synthesizer 
will  generate  the  same  number  of  pitch  periods  as  the  analysis  was 
shifted  down  before  performing  the  right  analysis.  To  insure  that 
the  f:rst  set  of  parameters  used  by  the  synthesizer  are  always  synchro- 
nized properly  with  the  left  analysis  set,  the  left  set  pitch  period  is 
repeated  for  each  synthesis  until  a new  channel  parameter  set  is 
received.  Stated  another  way,  the  synthesis  is  always  advanced  by  the 
same  amount  as  the  analysis  is  advanced. 

Even  though  this  approach  prevents  the  pitch  from  being  interpo- 
lated, the  reflection  coefficients  and  energy  are  interpolated  prior  to 
each  new  pitch  period  interval. 

Tn  summary,  the  pitch  is  not  interpolated  and  the  left  set  pitch 
period  is  repeated  until  a new  channel  parameter  set  arrives.  The 
reflection  coefficients  and  energy  are  linearly  interpolated  using  the 
left  and  right  channel  sets  as  end  points,  with  the  first  interpolated 
set  equaling  the  left  set.  New  interpolated  values  arc  used  at  the 
beginning  of  each  pitch  period.  Since  the  analysis  is  advanced  at  the 
same  rate,  this  method  insures  that  the  first  set  of  parameters  used 
for  synthesis  corresponds  (except  for  quantization  error)  to  the  left 
analysis  set.  Finally,  after  that  multiple  of  pitch  periods  have  been 
advanced  such  that  a new  set  of  channel  parameters  can  be  used,  the 
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right  set  becomes  the  left  set  and  the  channel  set  becomes  the  right 


1 


Conversion  to  Predictor  Coefficients 
Synthesis  is  performed  using  the  transversal  filter  configuration: 


t 

s = V a.  s . + c.e 
n 1 n-i  ® n 


n = 1 ,2, . . .M 


WlI'iRH  = synthesized  speech 


* , 


, - , 1 for  n=l , 0 n^ 

n random  numbers, 

l 9 


n^l,  voiced 
s,  unvoiced 


g = gain  term 
M = pitch  period 

Therefore,  since  reflection  coefficients  were  transmitted,  a set  of 
predictor  coefficients  must  be  obtained.  Using  the  standard  mapping 
from  'ofiections  to  predictors  Atax  [l],  a set  of  predictor  coefficients 
are  obtained  for  each  interpolated  set  of  reflection  coefficients. 

Gain  Calculation  and  Speech  Synthesis 
A value  for  the  gain  term,  g is  estimated  to  insure  that  the 
energy  of  the  synthesized  speech  signal  equals  the  energy  calculated 
from  the  original  waveform.  The  method  used  is  that  proposed  by  Atal 
[1].  This  method  requires  considerably  more  computation  than  simply 
using  the  square  root  of  the  energy  of  the  error  signal.  However,  it 
has  been  determined  that  using  the  latter  method  can  cause  amplitude 
modulation  of  the  synthesized  waveform,  whereas,  if  gain  is  found  by 
matching  energies,  this  modulation  does  not  occur. 


Speech  is  synthesized  using  the  recursion  defined  in  equation  1.16. 
The  resulting  samples  are  then  stored  on  magnetic  tape  or  disk  for 
subsequent  conversion  to  an  audio  waveform  by  D to  A conversion.  A 
detailed  study  of  synthesis  using  fixed  point  arithmetic  has  been 
developed  by  Markel  and  Gray  [4]. 

Channel  Parameter  Coding  and  Decoding 
This  study  did  not  consider  new  methods  for  optimally  quantizing 
the  channel  parameters.  Procedures  were  written  to  quantize  the  chan- 
nel parameters  based  upon  studies  of  Makhoul  and  Viswanathan  [4]  and 
Markel  [2].  The  reflection  coefficients,  k^  were  coded  by  linearly 
quantizing  the  log  area  functions  g^  derived  from  the  reflection  coef- 
ficients, where 


= log 


1+ki 


i-ki 


the  pitch  and  energy  were  logarithmically  quantized. 
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II.  ^lOOmiNG  REFLECTION  COEFFICIENTS 
USING  AN  A PRIORI  LEAST  SQUARES  ESTIMATOR 

Introduction 

This  chapter  discusses  how  the  a priori  least  squares  algorithm 
can  he  used  to  obtain  a smoothed,  minimum  variance  estimate  of  the 
reflection  coefficients.  As  will  be  shown,  smoothing  results  from  the 
fact  that  each  coelficient  is  filtered  by  a time-varying,  first  order, 
recursive  low  pass  filter  with  filter  coefficients  defined  by  the 
least  squares  algorithm.  Values  for  the  coefficients  are  dynamically 
updated  based  upon  the  short-term  characteristics  of  the  speech  wave- 
form itself.  As  a result  of  this  approach  to  smoothing,  the  filtering 
action  is  adaptive:  heavy  smoothing  during  stationary  portions  of  the 
waveform,  and  light  or  negligible  smoothing  during  nonstationary, 
transition  portions.  Secondly,  the  smoothing  is  efficient:  since  it  is 
accomplished  using  a set  of  first  order  filters,  the  additional  compu- 
tation equired  does  not  become  so  excessive  as  to  prevent  real-time 
implementation. 

The  chapter  has  three  main  sections;  the  general  development  of 
the  Minimum  Variance  A Priori  least  squares  estimate,  the  simplifica- 
tion of  the  algorithm  to  scalar  equations  using  the  Gram-Schmidt  ortho- 
ganalization,  and  the  implementation  and  results  of  the  algorithm. 

The  Minimum  Variance  A Priori  Least  Squares  Estimate 
The  problem  of  estimating  successive  sets  of  reflection 


25 


coefficients  from  successive  sets  of  analysis  frames  can  be  viewed  in 

general  terms  as  the  estimation  of  one  random  process,  (the  reflection 

coefficients)  from  observations  of  a different  but  related  process, 

(the  speech  waveform  itself).  For  each  analysis  frame  one  extracts 

a set  of  speech  samples,  s^  and  constructs  a data  vector  y^  and  a 

measurement  matrix  H . Using  this  information  an  estimate  k of  the 
random  process  kn  is  calculated.  The  dynamic  model  relating  y^  to  kn 

is  given  by 

y_  = HBk  + e II. 1 

n n n n n 

where  yn  is  an  N*1  vector  of  speech  samples,  and  B lln  is  an  N*p  matrix 

constructed  according  to  the  linear  predictive  coding  model , (see  Ref. 

[7]  for  a discussion  of  H and  next  section  for  a discussion  of  B ). 

n n 

The  vector  en  represents  the  modeling  or  prediction  error.  The  vector 
k^  represents  the  set  of  p reflection  coefficients  to  be  estimated 
during  the  n^  analysis  frame. 

The  minimum  variance  a priori  least  squares  estimate  k of  k is 

n n 

found  by  minimizing  the  loss  function,  Ln  of  the  nt^1  analysis  frame 

L = (y  - H B k )T  R'1  (y  - H B k ) + (k  - 1c  )T  M_1  (k  - Tc  ) II. 2 
n v/n  n n n'  n v'n  n n n'  v n n'  v n n' 


where 


T 

R * E (e  e },  the  N*N  positive  definite  covariance  matrix  of  e 
n n n ’ r n 


M = E{(k  - K)  (k  - F)1},  the  p*p  positive  definite  matrix  of 

n the  a priori  covariance  of  kn- 


The  estimate,  kn  is  found  by  minimizing  Ln  with  respect  to  k^  and 


r 


is  given  by  (see  Ref.  [7]  for  detailed  development) 


k = (bV  R'1  II  B + M'1)'1  (BTHT  R'1  y + M_1  k ) 
n nnn  n n n nn  n 7n  n n 


The  covariance,  Pn  of  k^  is  given  by 


P„  = Cov  (k  ) = E{ (k  - kj(k  - k )T)  = (BT  HT  R'1  H B + M'1)'1  II. 4 
n n'  1 n n7  v n n7  vnnnnn  n 7 


An  equivalent  expression  for  k^  is  given  by 


k = l + P UT  R"1  (y  - H B lc  ) 
n n nnn  7n  nnn7 


The  vector  represents  the  a priori  estimate  of  k . As  such  it 
represents  the  best  estimate  of  kn  prior  to  knowledge  of  y . For  this 
analysis,  it  is  assumed  that  the  reflection  coefficients  kn  obey  the 


following  relationship 


k ..  = k + w 
n+1  n n 


where 


E{w  } = 0 
n 


E{w  w. } = Q 6 
n j xn  n,j 


Thus  the  expected  value  of  kn  before  knowledge  of  yn  denoted 


= E {kn  I y2 Vl} 


is  given  by 


1<  = k . 
n n-1 


II. 10 
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with  covariance,  Mn  given  by 


Mn  - E{<kn  ' V (kn  ’ */> 


' E{fkn-1  *wn  - Vl)(kn-1  * w,  ' VP  > 


= P . + 0 
n-1  ti 


II. 11 


I 


■ 


I 


Thus  the  expected  variance  of  E equals  the  variance  of  k , plus  the 

n n- 1 

variance  of  the  difference  between  k and  k 

n n-1 

Note  that  this  definition  of  M allows  the  estimate  of  k to 

n n 

adjust  dynamically  to  the  changes  in  the  speech  waveform  itself.  For 

stationary  sections  of  speech  where  k = k , , Q - 0 and  M ~ P , 

n n-1’  xn  n n-1 

whereas  for  transition  regions  where  kn  differs  appreciably  from  k^  ^ , 

the  variance  on  k^  will  equal  the  variance  of  kn  ^ (the  best  estimate 

prior  to  measurement)  plus  the  variance  of  the  change  from  k , to 

n-l 

k • Thus  the  degree  to  which  you  believe  m 1(  as  an  estimate  of  k 
n n n 

is  directly  affected  by  how  things  have  changed  from  the  previous 
frame,  namely  the  covariance  Q^. 

The  expressions  given  in  equations  II. 3 through  II. 11  are  matrix 
equations  and  as  such  their  implementation  would  be  computationally 
prohibative  without  further  reduction.  The  next  section  describes  how 
these  equations  can  be  reduced  to  p sets  of  scalar  equations. 
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The  a priori  estimator  equations,  II. 3 through  II. 5 can  be 
reduced  to  p-sets  of  scalar  equations  by  taking  advantage  of  two  facts. 
First,  since  as  Mitsui  [8]  has  pointed  out,  the  reflection  coefficients 
represent  the  weights  of  an  orthoganal  basis  set,  which  add  up  to  form 
the  predicted  speech  wave,  they  are  independent  of  each  other.  Thus 
the  covariance  matrices  and  Pn  are  diagonal.  Second,  by  using  the 
Cholesky  decomposition  the  Gram  matrix,  H II  can  be  diagonalized. 
Specifically,  we  have  from  the  Cholesky  decomposition  that 

HTH  = LDU  1.5 

and 


B = U'1 

Thus  the  a priori  estimate,  equation  II. 3 can  be  diagonalized  as  fol- 
lows: given 


K - K "l  Rn'  Hn  'n*",1!1  (>£  < C V Mn  V 


II. 3 


let 


R = r I 
n n 


Mn  = diag  |>^] 


then  B l if  R'1  IIB  -U*1  UU" 1 
n n n n n r n 


- D 
r n 


11.12 


where 


D = diag  [d^] 
n u n - 


thus  k =f-  D + M1)'1  (-  B HT  y + M'1  k ) 

Kn  tr  n n ’ '■r  n n 7n  n n' 

n n 


11.13 


or  y = (I  + D 1 M 1 (—  D 1 B HT  y + D 1 M 1 k ) 11.14 

n vr  n n ' (r  n n n 7n  n n n 
n n 
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but 


D*1  B H y = S~l  H1  y = k,„  , Least  squares  classical  11.15 
n n n n yn  mL  estimate 


thus  k =[-  1+  (M  n 3]'1  f-  k,„  + (M  1)  )'1  lc  i 
n L r n n J *-  r mL  v n n n J 


II  .16 


n 


n 


The  matrix  product  N^D  will  be  diagonal,  equaling 

MDn  = diag  [m™  d™]  11.17 
Thus  equation  11.16  and  therefore  equation  II. 3 reduces  down  to  p-sets 
of  scalar  equations  given  by 


"A 


,U) 


n 


(i). 


n 


1 + J1 D_  1 + 


(i) 


n 


11.18 


n 


n 


3-1 ,2,. . . ,p 


In  addition  = k^ 


K J i=l  2 n 


11.10 


Thus  the  a priori  estimator  reduces  to  p-sets  of  first  order  recursive 
filters  on  the  reflection  coefficients. 


nr 


1 + A 


n 


1 + \ 


1 

HT 


n 


11.20 


where 


(i 

n 


r 

n 


i=l,2,. 


»P- 


Starting  with  equation  II. 5 an  equivalent  expression  for  k^  can  be 


derived  as 


:(i)  = *k(i)  + 


n 


A -1 

n-1 


1 + A 


nr 


( ' kn- j)  »2, . . . ,p 


11.20 


n 


The  covariance,  P of  k can  likewise  be  diagonalized  as  follows: 
n n ° 


P=  (BT  llT  R*1  II  B + M' V1  = r-  D + M'V1 
n 1 n n n nn  n vrn  n 


11.21 


or  for  each  coefficient 

m^d(i) 

Pfl)  = E{  (k^  - kll))2}  = mflV  n + -0 H— 

n v n n ’ n r 


) i~l » 2 , . 


>P 

11.22 


Thus  equation  11.20  can  be  expressed  in  terms  of  p^  as 


11.23 


n 


Using  equation  II.  11  each  diagonal  element  of  the  covariance,  on  ]<n 
is  given  by 

"n^  = E{fknl}  ' kn1'’)2*  " E{(kn-1  " kn-l)2}  + EUw,!;l})2}  H'24 


or 


mnl)  ’ Pn-1  * ^ i=1.2----P 


11.25 


Implementation  and  Results 

The  a priori  least  squares  estimate  is  obtained  from  the  follow- 
ing five  step  process. 
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Step 


(i) 


Compute  the  maximum  likelihood  estimate,  k^ 

and  diagonal  elements,  d^  of  D 

n n. 

Estimate  the  elements  of  the  covariance  matrix, 


^ * -e;  * %ri) 

5 Estimate  the  covariance  on  the  modeling  error,  r 


4 Compute  the  covariance  of  p^  of  k^  n^=  H 

n n ’ ln 


CD 


m 


1 + 


n n 


5 Compute  the  final  estimate,  k^ 

n 

Dfi)H(i) 

= kCi)  ♦ *JL-in (i)  fi) 

n n-1  r lKML  n-lJ 

n 

rhe  covariance  r^  of  the  modeling  error,  is  approximated  by 
computing  the  residual 


w = (yn  • "n  Bn  kML>  ' H„  B„  W 


II. 26 


or 


nv  = 4> 


0,0'j1+l,0ai  or  r0  -Jj  riai 


11.27 


and  setting 


r = ™ 
n N 


11.28 


where 


N = Coefficient  analysis  window  size 


The  covariance  defining  expected  variance  on  the  difference  in 
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- — 


reflection  coefficients  is  approximated  as 


% = E (kn  ' kn-l)2  = CkML|n  ' 


11.29 


The  initial  values  for  starting  the  algorithm  are  defined  to  be 


kCi)  = 0 

Ko 


Pitch  Synchronous  Analysis 


To  insure  that  the  a priori  estimate  kn  of  the  reflection  coef- 


ficients represents  a reasonable  estimate  of  k^  prior  to  using  the  data 


vector  y , the  coefficient  analysis  is  done  pitch  synchronously,  rather 
n 


than  at  the  slower  rate  defined  by  the  channel  frame  rate.  By  shorten- 


ing the  distance  between  successive  updates,  the  variance  Mn  of 


- k ) decreases  during  stationary  speech  segments  thus  increasing 
n n 


the  amount  of  smoothing. 


This  analysis  approach  is  similar  to  that  used  in  providing  the 
down-sampled  speech  to  the  pitch  detection  system.  That  is,  the  coef 
ficients  are  estimated  at  the  high,  pitch  synchronous  rate,  smoothed, 


then  down- sampled  for  channel  transmission.  The  primary  difference 


between  the  two  down- sampling  methods  is  that  the  smoothing  filter 


applied  to  the  coefficients  must  be  time-varying.  During  stationary 
segments  of  speech,  the  filter  has  a narrow  pass  band  with  its  pole 


near  the  unit  circle,  while  during  transition  regions,  the  filter 


essentially  locks  onto  the  input,  mainly  k^,  with  its  pole  near  the 


origin. 


i 


w 


Results 


The  results  of  using  smoothing  are  best  demonstrated  by  listening 
to  processed  speech  using  smoothed  reflection  coefficients.  Informal 
listening  tests  indicate  that  smoothing  improves  speech  quality  in  terms 
of  (1)  reducing  the  number  of  instabilities  when  using  the  covariance 
method  and  thereby  reducing  the  nunbcr  of  annoying  non-speech  like  pops 
resulting  from  hard  limiting  the  coefficient  values  back  to  0.97;  (2) 

reducing  the  roughness  on  sustained  vowel  regions  induced  by  slow  update 
rate,  (3)  eliminating  the  warbling  induced  by  step  discontinuities  in 
the  spectrum,  and  (4)  less  degrading  of  speech  quality  as  the  analysis 
window  size  is  narrowed. 

Figures  II. 1 and  II. 2 show  the  result  of  the  adaptive  smoothing 
applied  to  the  k j and  k^  reflection  coefficients.  The  successive 
phonemes  /el/,  / i/ , /a I /,  /oU/,  /u/,  were  digitized  at  6.4  kHz  and 
processed  using  a tenth  order  filtei  using  the  covariance  method.  The 
time  histories  of  both  coefficients  with  and  without  smoothing  is 
displayed,  with  the  darker  curve  corresponding  to  the  smoothed  estimate. 
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SUNMARY  AND  CONCLUSIONS 


As  has  ocen  shown,  when  the  reflection  coefficients  are  estimated 
using  the  a priori  algorithm,  smoothing  results  from  the  fact  that 
each  coefficient  is  filtered  bv  a time-varying,  first  order,  recursive 
low-pass  filter.  Values  for  the  coefficients  are  dynamically  updated 
based  upon  the  short-term  character i sties  of  the  speech  waveform 
itself.  There  arc  two  major  advantages  to  smoothing  in  this  manner. 

(1)  the  filtering  is  adaptive,  heavy  smoothing  during  stationary  portions 
of  the  waveform,  and  light  or  negligible  smoothing  during  non -stationary, 
transition  portions;  and  (2)  the  smoothing  algorithm  is  efficient, 
requiring  only  first  order  filters  and  therefore,  can  possibly  be 

implemented  in  real  time. 


III.  STREAK:  A Simplified  Technique  for  Recursively 
Estimating  Autocorrelation  k-Parameters 

Introduction 

A Simplified  Technique  for  Recursively  Estimating  Autocorrelation 
k-parameters  (hereafter  called  STREAK),  defines  a method  for  calculating 
the  k-parameters  associated  with  the  lattice  form  of  the  inverse  filter 
model  used  in  the  linear  prediction  analysis.  This  method  differs  from 
standard  LPC  approaches  in  two  major  respects:  one,  the  k-parameters 
are  estimated  directly  from  the  lattice  model;  and  two,  new  estimates 
are  calculated  for  each  A-P  sample.  This  chapter  describes  how 
these  coefficients  are  calculated,  how  they  may  be  used  in  an  analysis- 
synthesis  system,  and  how  this  technique  could  be  used  to  improve  the 
quality  of  a pitch  extraction  routine  based  upon  the  autocorrelation  of 
the  inverse  filter  output  sequence. 

The  standard  approach  in  linear  prediction  is  to  estimate  one  set 
of  M coefficients  from  a block  of  N data  points,  [1],  [2],  [3].  Values 
for  these  coefficients  are  calculated  so  as  to  minimize  the  sum  of  the 
squares  of  the  prediction  error  sequence.  As  such,  the  least  squares 
curve  fit  is  applied  uniformly  over  the  entire  block  of  N samples. 

This  paper  introduces  a new  concept  in  inverse  filtering.  Rather  than 
estimating  one  set  of  parameters  for  a window  of  N samples,  a new  least 
squares  estimate  of  each  parameter  is  calculated  at  each  point.  The 
analysis  is  based  upon  the  lattice  form  [2]  of  the  inverse  filter.  Val- 
ues for  the  k-parameters  are  obtained  directly  in  terms  of  the  forward 
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and  backward  prediction  error  sequences. 

The  method  is  called  a Simplified  Technique  for  Recursively 
Estimating  Autocorrelation  k-parameters,  or  STREAK  since:  (1)  only 

scalar  equations  are  involved  in  the  analysis,  thus  reducing  the  com- 
plexity of  implementation;  (2)  successive  k-parameter  estimates  are 
recursively  estimated  from  preceding  values;  and  (3)  like  the  standard 
autocorrelation  approach,  these  k-parameter  estimates  are  bounded  in 
magnitude  by  one. 

The  chapter  is  divided  into  two  parts:  a development  of  estimating 
k-parameters  directly  from  the  lattice  form  of  the  inverse  filter,  and  a 
comparison  of  the  inverse  filter  output  using  STREAK  versus  the  output 
using  the  standard  autocorrelation  method. 

The  Lattice  Formulation  for  Inverse  Filterin2 


Itakura  and  Saito  [10]  developed  a formulation  for  linear  predic- 
tion analysis  using  a lattice  form  for  the  inverse  filter.  A block  dia 
gram  of  their  PARCOR  analyzer  is  shown  in  Figure  II I. 1. 


et  (n) 


e (n ) 


Lattice  Form  Inverse  Filter 
Figure  III.l 


Both  forward  (n)  and  backward  em  prediction  error  sequences 
for  this  filter  are  defined  as 


m 


em  <n)  " am,i  s(n-i) 


III  . 1 


~ r \ b . s(n-i) 

em  Cn)  « .1  m»i  v J 


III. 2 


The  z-transforms  for  these  prediction  error  sequences  are 


defined  as 
+ 


em  ^ Um  M = \ Cz)  s (z) 


III. 3 


e (n)  U (z)  = B (z)  S (z) 
m in  n 


III. 4 


where 


m 


A (z)  = 7 a-  z 
m i=o  1,m 


-l 


m+1 


B (z)  = l b.  z 1 
m i.m 

l-l  ’ 


III. 5 


III. 6 


and  S(z)  is  the  z transform  of  the  input  signal,  s(n). 

It  can  be  shown  [19]  that  these  filter  polynomials,  Am  (z)  and 


Bm  (z)  satisfy  the  following  recursive  relation 


Vl  « ' Bm 


hi.: 


z B ,,  (z)  = B (z)  - k A Cz) 
m+1  v 1 m v J m m v ’ 


th 


III. 8 


where  k^  is  the  m k parameter. 

Using  equations  III. 7 and  III. 8 the  forward  and  backward  prediction 


39 


III. 9 


error  sequences  satisfy 


e (n)  = e*  (n) 


'm+1 


m 


1Sn  em  fn)>  e0  (n)  * sfn) 


Vl  f"*1)  - em  (n)  • K em  <">•  e0  fn)  ■ s(n'1) 


III. 10 


The  analysis  procedure  consists  of  estimating  km  based-  on  em  (n)  and 
em  (n)  then  advancing  to  the  next  stage  of  the  filter  using  equations 
III. 9 and  III. 10. 

From  the  analysis,  estimates  of  the  M reflection  coefficients 
km  m=0,  1,  . ..,  M-l  and  the  final  prediction  errors  e^  (n)  and  e^  (n) 
are  obtained. 

Estimating  k^  (Block  Analysis  Method) 

The  standard  procedure  for  estimating  k^  was  to  assume  that  the 
input  waveform  s(n)  was  stationary  over  an  interval  of  N samples  and 
estimate  k^  based  on  the  short-term  autocorrelation  of  e*  (n)  with 
e^  (n)  [2],  [19].  Using  this  approach  each  k^  was  calculated  as 


k = 
m 


N + 

1 e (n)  em  (n) 
m m 
n=l 


N . 7 N y 

[ \ em  (n)  \ em  'n)  3 

n=l  n=l 


1/2 


For  comparison  purposes  this  approach  will  be  defined  as  the  block 
analysis  method  since  one  set  of  k-parameters  are  estimated  for  a block 
of  N sample  points. 

The  next  section  develops  a method  for  estimating  new  values  for 
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the  k-parameters  at  each  sample,  n. 


Estimating  km  Single  Sanple  Analysis  Method,  STREAK 
An  estimate  of  each  k-parameter  at  each  sample  point  can  be 
obtained  by  calculating  that  value  for  estimating  k^  such  that  the  sum 
of  the  squares  of  e*+^  (n)  and  em+^  (n+1)  is  minimized  for  each  n. 

That  is,  referring  to  Figure  III.l,  a logical  criterion  for  estima- 
tion is  that  the  energies  at  the  next  stage  of  the  filter  should  be 
minimized.  Thus,  a value  for  k is  calculated  to  minimize 

[Vi  (n)]2  * Cvi  (n*1)]2- 

Therefore,  define  the  loss  function,  L at  the  in  stage  of  the 

m 


filter  as 


Lm  = [<Vl  'n)]2  * 1 Vl  (n+1)]2 


III. 12 


lm  is  to  be  minimized  with  respect  to  km>  Substituting  the 
expressions  for  e*+1  (n)  and  em+1  (n+1)  from  equations  III. 9 and  III. 10 


gives 


L ! [e*  (n)  - k e (n)l2  + [e_  (n)  - k e*  (n)]2  III.  13 

m u m ' mm  m mm 

is  minimized  by  equating  to  zero  the  derivative  of  Lm  with  respect 
to  km  and  solving  for  1^.  Thus 

■ 0 ” -2[em  (n)  • 'Si  em  fn>]em  (n)  ‘ 2[em  (n)  ‘ em  (n)]%  (n) 

III. 14 


2 e (n)  e"  (n) 
m m 


k (n)  = — i 5 n m=0,  1, 

^ [e*  (n)]2  + [e-  (n)]2 


. , M-l 


III. 15 


Equation  III. 15  along  w’+h  the  updating  equations  III. 9 and  III  10 
define  the  complete  analysis  procedure. 


As  each  new  sample,  s(n)  enters  the  inverse  filter,  new  values  for 


each  k-parameter  are  estimated.  Thus,  from  e^  (n)  and  eg  (n) , kg  (n) 
is  formed  using  equation  III.  15.  Next  e^  (n)  and  (n+1-)  are  computed 

using  equations  III. 9 and  III. 10.  The  analysis  then  advances  to  the 


next  section  of  the  filter  and  everything  repeats. 


Examining  the  analysis  equations  shows  that  the  total  number  of 


arithmetic  operations  for  each  sample  consists  of  five  multiples, 


three  adds,  and  one  divide  per  section,  times  M sections.  M+l  storage 


locations  are  required  for  the  em  (n)  and  e^  (n)  arrays  and  M locations 


for  the  k^  (n)  array. 


The  Relation  Between  the  k-narameters  and  the  Forward  and  Backward 


Prediction  Error 


From  equations  III. 9,  II I . 10  and  III. 15  a recursion  relating  the 


k-parameter  to  the  m+l  st  forward  and  backward  prediction  error  can 


be  obtained  which  is  identical  to  that  using  the  block  analysis  method. 


Thus  substituting  the  expression  for  1^  (n)  given  in  equation  11.15 


into  the  foiward  and  backward  prediction  error  sequences,  equations 


III. 9 and  III. 10  gives 


> , , * , , 2 en  (n) ('m  (n))2 

em+1  M = e (n)  - — « ; ’ 

m 1 m (em  (n))2  + Cem  (n))‘ 


III. 16 


2femM)2  em(n) 

Wn+1)  = em^n^  - / 2 ’ 

(em(n))Z  + (em(n))' 


III. 17 


Simplifying  gives 


* * (e*  Cn))2  - (e'  (n)l2 

e.„  (n)  - e (n)  -H 2 " 

(em  (n))2  * Cem  Cn))2 


“m+l 


III. 18 


hut 


Vi  (n+l)  - -em  (n)  Ce*  »))*  ~ ^ Cn))2 

* (<=m  '-"2 


1 


1 - 


(e, 


(n)  (n) 


T,  C"» 

2 


Cn))' 


* em  (n»2 


C<  Cn))2  - em  fn))2 

Cn))2  * (em  C„))f_ 

III. 19 


III. 20 


III. 21 


thus 

Cl  Cn)  = em  Cn)  C1  ‘ ^ Cn))**  III. 22 

j 

Vl  C"*l)  * 'em  C")  Cl  - >V  Cn))4  HI-23 


Equation  III. 22  shows  that  the  energy  of  the  forward  prediction 
error,  (em  (n))^  will  be  a monotonically  decreasing  function  of  m. 


Note  also  that  if 

le>’l  = lem  ("’I 

then 


DSn  <n)l  * 1 

and 

Vl  (n)  ‘ Vj  (n*l)  - 0 


i 
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For  this  situation,  e*  (n)  and  em  (n)  are  predicted  exactly  from 

+ ^ 
em  (n)  and  e^  (n)  respectively  and  thus  the  prediction  errors  em+^  (n> 

and  em+i  (n+1)  must  be  zero. 

Likewise  if 

•Sn  (n)  = 0 


then 


Ci (n)  ■ 4 (n) 


Cl  (ntl)  ' em  (n) 


Speech  Synthesis  Using  STREAK 

The  original  waveform  sn  = e*  (n)  can  be  reconstructed  from  the  k- 


parameters,  1^  (n) , m=0,  1,  ...  M-l,  and  the  final  forward  prediction 
error,  e^(n) . The  synthesis  equations  are  determined  by  expressing 
ein  ^ ecluat:^on  in  terms  of  e*+1  (n) , e^  (n)  and  \m  fnC  The 

resulting  synthesis  filter  Commonly  called  the  two  multiply  lattice 
filter  [20])  is  defined 


em  fn)  = Cl  (n)  T *»!  Cn) 


em+l  (n+1)  = em  tn)  ‘ km  (n)  em  tn) 


III. 24 


III. 25 


m = M-l,  — 0. 


= s 


with  e^  (n)  given  and  eg  (n)  -n 

Note  that  since  STREAK  calculates  new  values  for  k-parameters  at 
each  sample  point,  n,  the  synthesis  filter  can  also  be  updated  for  each 
new  sample.  Thus  the  filter  characteristics  can  vary  at  the  sampling 
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rate.  This  approach  represents  a somewhat  radical  departure  from  the 
standard  procedure  of  supplying  the  synthesizer  with  just  one  set  per 
pitch  period  or  per  anlaysis  frame.  To  do  this  of  course  demands  that 
new  k-parameters  be  transmitted  at  the  sampling  rate  thus  requiring  an 
enormous  channel  bandwidth.  However,  preliminary  experiments  show  that 
a pitch  excited  synthesizer  updated  continuously  with  k-parameters 
estimated  at  the  sampling  rate,  generates  synthetic  speech  having  a 
more  natural  quality  than  that  obtained  using  standard  block  analysis- 
synthesis  methods  such  as  described  in  Section  I.  Currently  methods 
are  being  investigated  for  reducing  the  bandwidth  but  retaining  this 


quality. 


A Geometrical  Interpretatior  ofk 


From  the  fact  that 
a2  + b2  < 2ab 
it  can  be  seen  that 


III. 26 


III. 27 


Geometrically  by  defining  two  lines  2^  and  2^  extending  from  the  ori- 
gin having  coordinates  [e*  (n) , em  (n)]  and  [em  (n) , em  (n)]  respec- 
tively, it  is  shown  that  km  equals  the  cosine  of  the  angle  between 
and  (see  Figure  III. 2).  Thus,  by  definition 


cos  9 = 


V*2 

ll\  1*2 


III. 28 


2,  • 2,  = 2 em  (n)  em  (n) 
1 L m m 


III. 29 


|*j|  = | »2 1 ={tem  + Ce"  (n)32> 


III. ?0 


Geometrical  Interpretation  of  km 
Figure  III. 2 

Thus,  k = cos  0 
m 

It  should  also  be  noted  that  minus  the  projection  of  *2  onto 

Hj  defines  a vector  having  coordinates  [e*+1  (n) , em+1  (n+1)].  It 

is,  of  course,  the  length  of  this  vector  which  is  minimized,  by  defin- 
ing  1^  as  cos  0. 

Using  STREAK  for  Pitch  Detection 

A well-established  technique  for  pitch  detection  consists  of  per- 
forming an  autocorrelation  on  the  error  signal  obtained  from  the 
inverse  filter  output  [7],  [12],  [18].  The  idea  behind  this  method  is 
that  if  the  all-pole  model  accurately  represents  the  vocal  tract  trans- 
fer function  and  the  radiation  and  glottal  volume  flow  effects,  then 
the  output  of  the  inverse  filter  should  resemble  an  impulse-like  driv- 
ing function  having  a period  equal  to  the  pitch  for  voiced  speech.  The 
autocorrelation  of  this  error  sequence  should  therefore  exhibit  a large 
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spike  located  at  a distance  from  the  origin  equal  to  the  pitch  period. 
This  method,  however,  will  occasionally  fail,  generating  a dominant 
peak  at  a distance  other  than  the  true  pitch  period.  In  these  cases, 
the  inverse  filter  has  not  done  an  adequate  job  of  removing  everything. 
Or,  stated  another  way,  the  curve  fit  was  insufficient. 

A solution  to  this  problem  is  to  generate  the  error  sequence  using 
STREAK  rather  than  from  conventional  block  analysis  methods.  This 
approach  results  in  a superior  least  squares  curve  fit  since  the  fit  is 
applied  on  a sample  by  sample  basis  rather  than  over  an  entire  block  of 
N samples. 

To  illustrate  the  improvement  in  inverse  filtering  using  STREAK 
over  the  block  analysis  method,  the  next  section  compares  the  forward 
prediction  error  sequences  for  various  phonemes  using  the  two  methods. 

Comparison  of  Inverse  Filter  Outputs 

The  forward  prediction  error  sequence  using  the  block  analysis 
method  was  generated  using  twelve  k- parameters  estimated  from  a 20  ms 
Hanning  window.  The  window  was  advanced  in  20  ms  steps.  The  sampling 
rate  was  8k  Hz.  The  data  was  not  preemphasized.  Twelve  k-parameters 
were  also  used  in  generating  the  error  sequence  using  the  STREAK  algo- 
rithm. 

For  each  comparison  three  figures  are  presented:  the  original 
waveform,  the  error  sequence  using  the  block  analysis,  and  the  error 
sequence  using  STREAK.  In  Figure  III. 3 the  entire  work  "oak"  as 
spoken  by  a low-pitched  male  in  the  context  "Oak  is  strong  ..." 
is  displayed  in  Figure  III  3 (c) , along  with  the  prediction  erroi 
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Figure  III. 5 


using  the  block  analysis  method  in  Figure  III. 3 (A)  and  using  STREAK 
in  Figure  III. 3 (B) . Note  that  the  error  sequence  using  STREAK 
exhibits  a considerably  flatter  spectral  character  than  the  block  anal- 
ysis error  sequence. 

In  Figure  III. 4 (C)  the  nasal  /n/  from  the  word  "friends"  spoken 
in  the  context  "Thieves  who  rob  friends  deserve  jail"  is  displayed. 
Figure  III. 4 (A)  displays  the  block  analysis  error  sequence  and  Figure 
I II. 4 (B)  displays  the  error  sequence  using  STREAK.  Note  the  absence 
of  periodicity  at  the  trailing  end  of  the  nasal  in  Figure  III. 4 (A), 
whereas  with  STREAK,  the  pitch  period  is  clearly  evident. 

Finally,  Figure  TII.S  (C)  displays  the  voiced  stop  /b/  followed 
by  the  semivowel  /r/  in  the  work  "break"  spoken  in  the  context  "Don't 
break  the  glass".  Again,  Figure  III. 5 (A)  displays  the  prediction 
error  using  block  analysis  and  Figure  III. 5 (B)  using  STREAK.  Examin 
ing  the  error  waveform  during  both  the  /b/  and  /r/  sections  shows  again 
that  STREAK  produces  a spectrally  flatter  error  sequence,  with  the 

fundamental  more  clearly  evident. 

These  three  examples  were  taken  from  a large  group  of  sentences 
spoken  by  both  male  and  female  speakers.  For  every  sentence  analyzed, 
the  two  error  sequences  exhibited  the  same  general  characteristics 
with  the  STREAK  algorithm  always  producing  the  superior  curve  fit. 

Implementing  STREAK  into  a Pitch  Tracking  Algorithm 
1*1  Chapter  1 a pitch  detection  method  was  described  in  which  a 
new  sequence  was  formed  by  prefiltering  and  down-sampling  the  original 
speech  waveform.  (See  Figure  1.3).  The  pitch  period  was  estimated  by 
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autocorrelating  the  error  sequence  generated  from  the  smoothed  down- 
sampled waveform,  and  locating  the  distance  of  the  largest  positive 
peak.  STREAK  can  be  incorporated  into  this  algorithm  by  simply 
replacing  that  procedure  for  estimating  of  the  inverse  filter  and  error 
sequence  generation  using  the  block  analysis  method,  with  the  error 
sequence  generator  using  STREAK.  All  other  procedures  are  left 
unchanged.  An  example  comparing  the  results  of  the  two  methods  is 
shown  in  Figure  III. 6 through  III. 10.  Figure  III. 6 shows  the  smoothed, 
down-sampled  waveform  from  which  the  pitch  estimate  is  to  be  determined. 
This  speech  sample  is  from  the  phoneme  /o/  in  "four".  It  was  obtained 
by  low  pass  filtering  an  8 kHz  sampled  waveform  at  750  Hz  and  down- 
sampling  by  four.  Thus,  the  eighty  samples  represent  a 40  ms  window. 

The  error  sequence  using  the  block  analysis  approach  is  shown  in  Figure 
III. 7.  This  sequence  was  generated  by  a four  pole  inverse  filter  using 
predictor  coefficients  estimated  from  the  eighty  sample  window.  Figure 
III. 8 shows  the  error  sequence  generated  by  STREAK  from  the  same  data 
using  a fourth  order  filter.  Figures  III. 9 and  III. 10  show  the  auto- 
correlations of  each  of  these  error  sequences.  Note  that  a pitch 
doubling  error  results  when  the  block  analysis  is  used  but  that  the 
correct  pitch  period  of  25  will  be  chosen  when  the  STREAK  analysis  is 
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SUMMARY  AND  CONCLUSION? 


A technique  for  recursively  estimating  th^  k-parameters  of  the 

Linear  Predictive  Coding  inverse  filter  has  been  developed.  The 

k-parameters  are  estimated  directly  from  the  lattice  form  of  the 

inverse  filter.  The  criterion  for  estimation  was  that  a value  for 

k be  calculated  so  as  to  minimize  the  sum  of  the  squares  of  the 
m 

m + 1st  forward  and  backward  prediction  error  sequences.  New  estimates 
of  each  k-parameter  of  calculated  at  each  sample  point. 

It  was  shown  that  the  least  squares  curve  fit  using  this  method 
was  superior  to  that  using  the  block  analysis  method  and  therefore 
that  this  method  may  improve  pitch  detection  schemes  based  upon  the 
autocorrelation  of  the  inverse  filter  output. 
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